---
title: Enrich data using Feature Discovery
dataset_name: N/A
description: How Feature Discovery helps you combine datasets of different granularities and perform automated feature engineering.
domain: platform
expiration_date: 10-10-2024
owner: izzy@datarobot.com
title: Enrich data using Feature Discovery
url: docs.datarobot.com/docs/tutorials/prep-learning-data/enrich-data-using-feature-discovery.html

---

# Enrich data using Feature Discovery {: #enrich-data-using-feature-discovery }

In this tutorial, you'll learn how Feature Discovery helps you combine datasets of different granularities and perform automated feature engineering.

More often than not, features are split across multiple data assets. Bringing these data assets together can take a lot of work&mdash;joining them and then running machine learning models on top. It's even more difficult when the datasets are of different granularities. In this case, you have to aggregate to join the data successfully.

Feature Discovery solves this problem by automating the procedure of joining and aggregating your datasets. After defining how the datasets need to be joined, you leave feature generation and modeling to DataRobot.

This tutorial uses data taken from Instacart, an online aggregator for grocery shopping. The business problem is to predict whether a customer is likely to purchase a banana.


## Takeaways {: #takeaways }

This tutorial shows how to:

* Add datasets to a project
* Define relationships
* Set join conditions
* Configure time-aware settings
* Review features that are generated during Feature Discovery
* Score models built using Feature Discovery


## Load the datasets to AI Catalog {: #load-the-datasets-to-ai-catalog }

The tutorial uses these datasets:

| Table | Description |
|----|----|
| Users | Information on users and whether or not they bought bananas on particular order dates. |
| Orders | Historical orders made by a user. A User record is joined with multiple Order records. |
| Transactions | Specific products bought by the user in an order. An Order record is joined with multiple Transaction records. |


Each of these tables has a different unit of analysis, which defines the *who* or *what* you're predicting, as well as the level of granularity of the prediction. This tutorial shows how to join the tables together so that you have a suitable unit of analysis that produces good results.

Start by loading the primary dataset&mdash;the dataset containing the target feature you want to predict.

1. Go to the **AI Catalog** and for each dataset you want to upload, click **Add to catalog**.

     ![](images/tu-fd-ai-catalog-add-to-catalog.png)

     You can add the data in various ways, for example, by connecting to a data source or uploading a local file.

2. Once all of your datasets are uploaded, select the dataset you want to be your primary dataset and click **Create project** in the upper right.

     ![](images/tu-fd-ai-catalog-primary.png)

## Add secondary datasets {: #add-secondary-datasets }

Once you upload your datasets to the AI Catalog, you can add the secondary datasets to the primary dataset in the project you created.

1. In the project you created, specify your target, then under **Secondary Datasets**, click **Add datasets**.

    ![](images/tu-fd-add-secondary-dataset.png)

2. On the **Specify prediction point** page of the **Relationship editor**, select the feature that indexes your primary dataset by time under **Select date feature to use as prediction point**. Then click **Set up as prediction point**.

    ![](images/tu-fd-time-index.png)

     In this dataset, the date feature is `time`.

3. In the **Add datasets** page of the **Relationship editor**, select **AI Catalog**.

    ![](images/tu-fd-rel-editor-add-datasets-ai-cat.png)

4. In the **Add datasets** window, click **Select** next to each dataset you want to add, then click **Add**.

    ![](images/tu-fd-select-datasets-to-add.png)

5. Click **Continue** to finalize your selection.

    ![](images/tu-fd-click-continue-rel-editor.png)

## Define relationships {: #define-relationships }

Next, create relationships between your datasets by specifying the conditions for joining the datasets, for example, the columns on which they are joined. You can also configure time-aware settings if needed for your data.

1. On the **Define Relationships** page, click a secondary dataset to highlight it, then click the plus sign that appears at the bottom of the primary dataset tile.

     ![](images/tu-fd-add-relation.png)

2. Set join conditions&mdash;in this case, specify the columns for joining. DataRobot recommends the `user_id` column for the join. Click **Save and configure time-aware**.

     ![](images/tu-fd-specify-columns-for-joining.png)

    ??? tip
        Instead of a single column, you can add a list of features for more complex joining operations. Click **+ join condition** and select features to build complex relationships.

3. Select the time feature from the secondary dataset and the feature derivation window, and click **Save**.

    ![](images/tu-fd-time-index-and-fdw.png)

    See [Time series modeling](time/index) for details on setting time-aware options.

4. Repeat these steps to add any other secondary datasets.

    In this example, the three datasets are joined with these relationships:

    ![](images/tu-fd-complete-relationship.png)

## Build your models {: #build-your-models }

Now that the secondary datasets are in place and DataRobot knows how to join them, you can go back to the project and begin modeling.

1. Click **Continue to project** in the top right.

     ![](images/tu-fd-continue-to-project.png)

     Back on the main **Data** page, you can see under **Secondary Datasets** that two relationships have been defined for the *Orders* secondary dataset and one relationship has been defined for the *Transactions* secondary dataset.

     ![](images/tu-fd-start.png)

2. Click **Start** to begin modeling.

     DataRobot loads the secondary datasets and discovers features:

     ![](images/tu-fd-worker-queue.png)

     In the next section, you'll learn how to analyze them.


## Review derived features {: #review-derived-features }

DataRobot automatically generates hundreds of features and removes features that might be redundant or have a low impact on model accuracy.

!!! note
    To prevent DataRobot from removing less informative features, turn off supervised feature reduction on the [**Feature Reduction** tab](fd-gen#disable-feature-reduction) of the **Feature Discovery Settings** page.

You can begin reviewing the derived features once EDA2 completes.

1. On the **Data** tab, click a derived feature and and view the **Histogram** tab.

     ![](images/tu-fd-histogram.png)

     Derived feature names include the dataset alias and the type of transformation. In this example, the transformation is the unique count of orders by the day of the month.

2. Click the **Feature Lineage** tab to see how this feature was created.

     ![](images/tu-fd-feature-lineage.png)

3. To download the new dataset with the derived features, scroll to the top of the **Data** page, click the **Feature Discovery** tab, click the menu icon on the right, and select **Download dataset**.

     ![](images/tu-fd-download-dataset.png)

4. To understand the process DataRobot used to derive and prune the features, click the menu icon on the right and click **Feature Derivation log**.

     ![](images/tu-fd-select-feature-derivation-log.png)

     The **Feature Derivation Log** shows information about the features processed, generated, and removed, along with the reasons why features were removed. You can optionally save the log by clicking **Download**:

     ![](images/tu-fd-feature-derivation-log.png)

## Score models built with Feature Discovery {: #score-models-built-with-feature-discovery }

When scoring models built with Feature Discovery, you need to ensure the secondary datasets are up-to-date and that feature derivation will complete without problems.

To make predictions on models built with Feature Discovery:

1. In the **Models** page, click the **Leaderboard** tab and click the model you selected for deployment.

2. Click **Predict**, then under **Prediction Datasets**, click **Import data from** and import the scoring dataset.

     ![](images/tu-fd-predict-import-data.png)

     The dataset must have the same schema as the dataset used to create the project. The target column is optional and you don't need to upload secondary datasets at this point.

3. After the dataset is uploaded, click **Compute Predictions**.

     ![](images/tu-fd-compute-predictions.png)

4. To change the default configuration for the secondary datasets, under **Secondary datasets configuration**, click **Change**.

     ![](images/tu-fd-change-secondary-config.png)

    Updating the secondary dataset configuration is necessary if the scoring data has a different time period and is not joinable with the secondary datasets used in the training phase.

5. To add a new configuration, click **create new**.

     ![](images/tu-fd-create-new-config.png)

6. To replace secondary dataset, on the **Secondary Datasets Configuration** window, locate the secondary dataset and click **Replace**.

     ![](images/tu-fd-replace-secondary-configs.png)

!!! note
    If you need to replace a secondary dataset, do so before uploading your scoring dataset to DataRobot. If not, DataRobot will use the default settings to compute the joins and perform feature derivation.

## Learn more {: #learn-more }

See the following documentation topics for detailed information on Feature Discovery:

* [Creating a project for Feature Discovery](fd-overview)
* [Time-aware feature engineering](fd-time)
* [Derived features](fd-gen)
* [Predictions](fd-predict)
